Take-Home Exercise 3

Author

Tan Wen Yang

Published

March 11, 2023

Modified

March 11, 2023

Loading in all the packages and data

NOTE: TAKE THE PROF’s MPSZ 2019 DATA FROM INCLASS-EX 09!!!! Rather than 2014

Compile to rds once the data wrangling part is done.

Geographically weighted methods (REgression / Random FoREST) and Ordinary Least Square Method (Conventional non-geo-weighted linear regression)

Note that corrplot cannot be loaded (Therefore has to be loaded from package)

pacman::p_load(sf, tidyverse, tmap, sfdep, onemapsgapi, httr, jsonlite, olsrr, ggpubr, GWmodel, dotenv, matrixStats, spdep, SpatialML, Metrics, rsample)
package 'matrixStats' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Wen Yang\AppData\Local\Temp\RtmpeSkHnj\downloaded_packages
package 'Metrics' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Wen Yang\AppData\Local\Temp\RtmpeSkHnj\downloaded_packages

Geospatial Data:

Singapore National Boundary and Master Plan 2019 subzone!

  • Singapore National Boundary is a polygon feature data showing the national boundary!

  • Master Plan 2019 subzone are information on URA 2019!

mpsz <- st_read(dsn="data/geospatial", "MPSZ-2019")
Reading layer `MPSZ-2019' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
national_boundary <- st_read(dsn="data/geospatial", "CostalOutline")
Reading layer `CostalOutline' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 60 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 2663.926 ymin: 16357.98 xmax: 56047.79 ymax: 50244.03
Projected CRS: SVY21

Aspatial Data

  • Resale Flat prices

    • Training dataset from 1st January 2021 to 31st December 2022

    • Test dataset from 1 January to the last day of February 2023 resale prices For our

For our case, we will be using four-room flats for our prediction

resale_flat <- read_csv("data/aspatial/resale-flat-prices-from-jan-2017-onwards.csv")

Locational Factors:

For our assignment, we will need to look at location for

  • Proximity to CBD

  • Proximity to Eldercare

  • Proximity to Food court/Hawker

  • Proximity to MRT

  • Proximity to Park

  • Proximity to ‘Good’ primary schools

  • Proximity to Shopping Malls

  • Proximity to supermarket

  • Number of Kindergartens within 350m

  • Number of childcare centers within 350m

  • Number of primary schools within 1km

We start by sourcing some of this data

Geospatial Locational Source:

MRT/LRT and Bus Stop Locations

mrt_lrt <- st_read(dsn="data/geospatial", layer="RapidTransitSystemStation")
Reading layer `RapidTransitSystemStation' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
bus_stop <- st_read(dsn="data/geospatial", layer="BusStop")
Reading layer `BusStop' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

There is something wrong, the MRT/LRT dataset is giving us polygon type rather than point. Despite the dataset saying it’s point.

OneMapSG Service API

We can extract the following according to their theme on OneMapSG API

  • Childcare (childcare)

  • Eldercare (eldercare)

  • Hawker Center (Queryname: hawkercentre)

  • Kindergarten (Queryname: Kindergartens)

  • Parks (Queryname: nationalparks)

Extra if we have time to think about:

  • Libraries (Queryname: libraries) [NUMBER WITHIN 350M]

  • Integrated Screening Programmes (queryname: moh_isp_clinics)

  • Tourist Attractions (Queryname: tourism) [NUMBER WITHIN 1KM]

Process of going through to create the shp file:

Courtesy of Megan’s work. (The following code chunk will be repeated to create the shp file for all the themes above).

#load_dot_env(file=".env")
#token <- Sys.getenv("TOKEN")
#themetibble <- get_theme(token, "themename")
#themetibble
#themesf <- st_as_sf(themetibble, coords=c("Lng", "Lat"), crs=4326)
#st_write(themesf, "themename.shp")

Load in all the shp data

childcare <- st_read(dsn="data/geospatial", layer="childcare")
Reading layer `childcare' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1925 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6878 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
Geodetic CRS:  WGS 84
eldercare <- st_read(dsn="data/geospatial", layer="eldercare")
Reading layer `eldercare' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.7119 ymin: 1.271472 xmax: 103.9561 ymax: 1.439561
Geodetic CRS:  WGS 84
hawker_centre <- st_read(dsn="data/geospatial", layer="hawkercentre")
Reading layer `hawkercentre' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 125 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
Geodetic CRS:  WGS 84
kindergarten <- st_read(dsn="data/geospatial", layer="kindergartens")
Reading layer `kindergartens' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 448 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
Geodetic CRS:  WGS 84
parks <- st_read(dsn="data/geospatial", layer="nationalparks")
Reading layer `nationalparks' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 421 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
Geodetic CRS:  WGS 84
libraries <- st_read(dsn="data/geospatial", layer="libraries")
Reading layer `libraries' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 31 features and 13 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.7045 ymin: 1.263922 xmax: 103.9494 ymax: 1.448197
Geodetic CRS:  WGS 84
isp_clinics <- st_read(dsn="data/geospatial", layer="moh_isp_clinics")
Reading layer `moh_isp_clinics' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 378 features and 15 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6907 ymin: 1.26397 xmax: 103.9903 ymax: 1.456037
Geodetic CRS:  WGS 84
tourism <- st_read(dsn="data/geospatial", layer="tourism")
Reading layer `tourism' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 109 features and 15 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.684 ymin: 1.223 xmax: 103.974 ymax: 1.447
Geodetic CRS:  WGS 84

External Sources

  • Supermarkets (Source: Dataportal)

  • Primary School (Wikipedia, Reverse Geocoded using OneMap API)

  • Top Primary Schools (We will pick the top 10 based off SchLah’s dataset)

  • Malls (Taken from this dataset) [NOTE: SVY21]

supermarkets <- st_read("data/geospatial/supermarkets.kml")
Reading layer `SUPERMARKETS' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial\supermarkets.kml' 
  using driver `KML'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

Load Primary School Data with Lat/Long

The lat/long are taken from OneMapSG API

Top 10 schools as well, dervied from schlah

primary_sch <- read_csv("data/aspatial/primary_school_geo.csv")
top10_pri_sch <- read_csv("data/aspatial/top_10_primary_school_geo.csv")

Load Shopping Malls

shopping_mall <- read_csv("data/aspatial/shopping_mall.csv")

Data Wrangling (Geospatial)

List of task to do:

  • Convert some remaining data to Geospatial

  • Convert multipoint to point data (Removing the ‘z’) for supermarket

  • Convert all if not already in SVY21 into SVY21 (3414)

  • Remove all unnecessary columns

  • Check for null values

Convert all of our datasets from CSV to Geospatial (shp)

  • Primary Schools

  • Shopping Malls

Within primary school, the OneMapAPI is not able to properly return the correct coordinate as the name is ‘CATHOLIC HIGH SCHOOL (PRIMARY)’. For it to return the correct coordinate, the name has to be truncated without (PRIMARY).

primary_sch_sf <- st_as_sf(primary_sch, coords=c("LONG", "LAT"), crs=4326)
top_primary_sch_sf <- st_as_sf(top10_pri_sch, coords=c("LONG", "LAT"), crs=4326)
shopping_mall_sf <- st_as_sf(shopping_mall, coords=c("longitude", "latitude"), crs=4326)

Remove ‘Z-Dimension’ for Supermarket Data

supermarkets <- st_zm(supermarkets)

Check and change all EPSG for Geospatial data to ESPG 3414

childcare3414 <- st_transform(childcare, crs=3414)
eldercare3414 <- st_transform(eldercare, crs=3414)
hawker_centre3414 <- st_transform(hawker_centre, crs=3414)
kindergarten3414 <- st_transform(kindergarten, crs=3414)
parks3414 <- st_transform(parks, crs=3414)
libraries3414 <- st_transform(libraries, crs=3414)
isp_clinics3414 <- st_transform(isp_clinics, crs=3414)
tourism3414 <- st_transform(tourism, crs=3414)
primary_sch_sf_3414 <- st_transform(primary_sch_sf, crs=3414)
top_primary_sch_sf_3414 <- st_transform(top_primary_sch_sf, crs=3414)
shopping_mall_sf_3414 <- st_transform(shopping_mall_sf, crs=3414)

Check others:

st_crs(primary_sch_sf_3414)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(top_primary_sch_sf_3414)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(shopping_mall_sf_3414)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_crs(supermarkets)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

Looks like Supermarket is still in WGS84, let’s fix that

supermarkets3414 <- st_transform(supermarkets, crs=3414)

Check to make sure…

st_crs(supermarkets3414)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Remove all unnecessary columns

After glimpsing at the sf dataframe, we are going to remove all useless columns.

  • Information such as addresses/descriptions are not necessary, as all we need are the point data

  • We will keep the name for identification purposes

We will keep name and geometry

childcare3414 <- childcare3414 %>%
  select(c('NAME', 'geometry'))

Check result

childcare3414
Simple feature collection with 1925 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11810.03 ymin: 25596.33 xmax: 45404.24 ymax: 49300.88
Projected CRS: SVY21 / Singapore TM
First 10 features:
                                             NAME                  geometry
1  APOLLO INTERNATIONAL PRESCHOOL PRIVATE LIMITED POINT (40985.94 33848.38)
2                    APPLE TREE PLAYHOUSE PTE LTD POINT (28308.65 45530.47)
3  Appleland Montessori Child Care Centre Pte Ltd POINT (17828.84 36607.36)
4                             APPLELAND PLAYHOUSE POINT (25579.73 29221.89)
5              APRICOT ACADEMY (LAGUNA) PTE. LTD. POINT (38981.02 32483.41)
6                                 Arise Preschool    POINT (21588.47 36307)
7        Artemis Preskool @ Tampines Pte Ltd (CC)  POINT (39239.78 37501.4)
8                    Artemis Preskool @ Woodleigh POINT (32389.52 35403.72)
9                     ARTS JUNIOR MONTESSORI  LLP POINT (25554.36 30090.82)
10                   Arts Kidz Pre-School Pte Ltd    POINT (28004.17 28442)

Same thing, name and geometry

eldercare3414 <- eldercare3414 %>%
  select(c('NAME', 'geometry'))

Check result

eldercare3414
Simple feature collection with 133 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21 / Singapore TM
First 10 features:
                                                           NAME
1                                  Yuhua Senior Activity Centre
2                                          THK SAC @ Kaki Bukit
3                                            THK SAC @ Boon Lay
4                        PEACE-Connect Senior Activity Centre@5
5                                        THK SAC @ Beo Crescent
6                                      Silver ACE @ Bukit Merah
7  Lions Befrienders Senior Activity Centre @ Tampines Blk 499C
8                    Care Corner Senior Activity Centre (WL569)
9           Fei Yue Senior Activity Centre (Bukit Batok Branch)
10       COMNET Senior Activity Centre @ 182 Rivervale Crescent
                    geometry
1  POINT (16614.08 36639.12)
2  POINT (38803.81 35098.77)
3  POINT (14481.92 36357.61)
4  POINT (31505.35 31853.52)
5  POINT (27218.35 30135.49)
6  POINT (27278.94 29350.17)
7  POINT (41665.14 37956.92)
8  POINT (23147.94 45761.17)
9  POINT (18820.58 36396.32)
10  POINT (36446.37 41376.9)
hawker_centre3414 <- hawker_centre3414 %>%
  select(c('NAME', 'geometry'))
hawker_centre3414
Simple feature collection with 125 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 12874.19 ymin: 28355.97 xmax: 45241.4 ymax: 47850.43
Projected CRS: SVY21 / Singapore TM
First 10 features:
                                   NAME                  geometry
1           Market Street Hawker Centre  POINT (29874.82 29650.7)
2          Marsiling Mall Hawker Centre POINT (22042.51 46139.03)
3          Margaret Drive Hawker Centre  POINT (24816.7 31094.91)
4       Fernvale Hawker Centre & Market  POINT (32867.9 41500.77)
5             One Punggol Hawker Centre POINT (35955.52 43336.13)
6          Bukit Canberra Hawker Centre POINT (26794.39 47850.43)
7                   Senja Hawker Centre POINT (19953.85 41008.06)
8                Buangkok Hawker Centre POINT (34575.37 40482.99)
9        Bukit Batok West Hawker Centre  POINT (17817.9 37478.45)
10 Telok Blangah Hawker Centre & Market     POINT (25191 28414.9)
isp_clinics3414 <- isp_clinics3414 %>%
  select(c('NAME', 'geometry'))
isp_clinics3414
Simple feature collection with 378 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 12130.5 ymin: 27388.9 xmax: 45475.65 ymax: 48626.7
Projected CRS: SVY21 / Singapore TM
First 10 features:
                                  NAME                  geometry
1             HEALTHWAY MEDICAL CLINIC POINT (35625.67 40809.19)
2             HEALTHWAY MEDICAL CLINIC  POINT (31090.4 33520.89)
3             HEALTHWAY MEDICAL CLINIC POINT (40903.43 39446.08)
4             HEALTHWAY MEDICAL CLINIC POINT (41384.47 37152.14)
5             HEALTHWAY MEDICAL CLINIC POINT (27371.14 45806.89)
6             HEALTHWAY MEDICAL CLINIC POINT (28014.68 45473.31)
7             HEALTHWAY MEDICAL CLINIC POINT (26462.45 47759.61)
8  HEALTHWISE MEDICAL CLINIC & SURGERY POINT (33725.38 32238.48)
9                    HL CLINIC PTE LTD  POINT (26451.91 28874.8)
10           HO MEDICAL CENTRE PTE LTD POINT (21103.55 32951.85)
kindergarten3414 <- kindergarten3414 %>%
  select(c('NAME', 'geometry'))
kindergarten3414
Simple feature collection with 448 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11909.7 ymin: 25596.33 xmax: 43395.47 ymax: 48562.06
Projected CRS: SVY21 / Singapore TM
First 10 features:
                                                         NAME
1  PCF Sparkletots Preschool @ Cheng San-Seletar Blk 435 (KN)
2  PCF Sparkletots Preschool @ Cheng San-Seletar Blk 533 (KN)
3  PCF Sparkletots Preschool @ Cheng San-Seletar Blk 556 (DS)
4         PCF Sparkletots Preschool @ Chong Pang Blk 107 (KN)
5         PCF Sparkletots Preschool @ Chong Pang Blk 122 (KN)
6       PCF Sparkletots Preschool @ Chua Chu Kang Blk 10 (KN)
7           PCF Sparkletots Preschool @ Clementi Blk 330 (DS)
8              PCF Sparkletots Preschool @ Eunos Blk 616 (KN)
9           PCF Sparkletots Preschool @ Fengshan Blk 126 (DS)
10         PCF Sparkletots Preschool @ Fernvale Blk 416A (DS)
                    geometry
1  POINT (30325.45 38859.25)
2  POINT (30190.51 39574.18)
3   POINT (30705.05 39337.9)
4  POINT (27354.73 45992.92)
5  POINT (27755.87 46300.26)
6  POINT (19307.62 40271.08)
7  POINT (20706.39 32892.83)
8   POINT (37089.3 34892.34)
9  POINT (39752.13 34487.87)
10 POINT (33190.98 41392.37)
libraries3414 <- libraries3414 %>%
  select(c('NAME', 'geometry'))
libraries3414
Simple feature collection with 31 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 13665.24 ymin: 27383.57 xmax: 40922.89 ymax: 47759.75
Projected CRS: SVY21 / Singapore TM
First 10 features:
                           NAME                  geometry
1  Choa Chu Kang Public Library POINT (18175.64 40767.45)
2      Punggol Regional Library POINT (36014.52 43461.09)
3     Ang Mo Kio Public Library POINT (29364.62 39642.82)
4          Bedok Public Library POINT (38947.87 34357.48)
5         Bishan Public Library POINT (29727.41 36885.15)
6    Bukit Batok Public Library POINT (18646.83 36869.57)
7  Bukit Panjang Public Library POINT (20324.67 40211.98)
8        Central Public Library POINT (30327.79 31087.66)
9      Cheng San Public Library  POINT (34728.1 39397.71)
10      Clementi Public Library  POINT (20319.52 33038.5)

We only keep STN_NAM_DE and geometry

Note that the file is in polygon format, rather than point. Might have to deal with it.

https://stackoverflow.com/questions/52522872/r-sf-package-centroid-within-polygon

mrt_lrt <- mrt_lrt %>%
  select(c('STN_NAM_DE', 'geometry'))

Check

mrt_lrt
Simple feature collection with 220 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
First 10 features:
                STN_NAM_DE                       geometry
1    ESPLANADE MRT STATION POLYGON ((30566.07 30621.21...
2   PAYA LEBAR MRT STATION POLYGON ((34495.6 33384.44,...
3  DHOBY GHAUT MRT STATION POLYGON ((29293.51 31312.53...
4       DAKOTA MRT STATION POLYGON ((34055.08 32290.62...
5     LAVENDER MRT STATION POLYGON ((31236.5 32085.76,...
6      RENJONG LRT STATION POLYGON ((34382.66 40949.64...
7        DOVER MRT STATION POLYGON ((21987.25 32576.91...
8      HOUGANG MRT STATION POLYGON ((34585.74 39132.47...
9      PHOENIX LRT STATION POLYGON ((19602.92 40048.64...
10    ALJUNIED MRT STATION POLYGON ((33422.35 33158.32...
parks3414 <- parks3414 %>%
  select(c('NAME', 'geometry'))
parks3414
Simple feature collection with 421 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 12374.75 ymin: 21917.81 xmax: 52533.09 ymax: 49296.46
Projected CRS: SVY21 / Singapore TM
First 10 features:
                    NAME                  geometry
1      ENG KONG PLACE FC POINT (20647.02 35214.35)
2         JANGGUS GARDEN POINT (28409.23 48791.68)
3     JLN LIMAU MANIS PG POINT (40880.22 34107.22)
4         GARDEN VIEW PG POINT (31600.01 38049.84)
5       THOMSON GREEN PG POINT (27919.96 40074.49)
6           JLN RIANG PG POINT (31789.37 36584.45)
7   MEI HWAN CRESCENT PG POINT (31162.69 37212.14)
8          FULTON AVE PG   POINT (28026.6 38242.5)
9      MIMOSA TERRACE PG POINT (31016.83 40664.48)
10 JLN GENENG INTERIM PK  POINT (33295.7 37270.28)
primary_sch_sf_3414 <- primary_sch_sf_3414 %>%
  select(c('Primary school', 'geometry'))
primary_sch_sf
Simple feature collection with 181 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6878 ymin: 1.274958 xmax: 103.9628 ymax: 1.456608
Geodetic CRS:  WGS 84
# A tibble: 181 × 6
   `Primary school`         Type  Gender Locat…¹ TAG              geometry
 * <chr>                    <chr> <chr>  <chr>   <chr>         <POINT [°]>
 1 Admiralty Primary School Gove… Mixed  Woodla… ADMI…    (103.8 1.442635)
 2 Ahmad Ibrahim Primary S… Gove… Mixed  Yishun  AHMA… (103.8329 1.433153)
 3 Ai Tong School           Gove… Mixed  Bishan  AI T…  (103.833 1.360583)
 4 Alexandra Primary School Gove… Mixed  Bukit … ALEX… (103.8244 1.291334)
 5 Anchor Green Primary Sc… Gove… Mixed  Sengka… ANCH…  (103.8872 1.39037)
 6 Anderson Primary School  Gove… Mixed  Ang Mo… ANDE… (103.8414 1.384264)
 7 Ang Mo Kio Primary Scho… Gove… Mixed  Ang Mo… ANG … (103.8396 1.369322)
 8 Anglo-Chinese School (J… Gove… Boys   Central ANGL…   (103.841 1.30935)
 9 Anglo-Chinese School (P… Gove… Boys   Bukit … ANGL… (103.8356 1.318371)
10 Angsana Primary School   Gove… Mixed  Tampin… ANGS… (103.9518 1.348553)
# … with 171 more rows, and abbreviated variable name ¹​Location
supermarkets3414 <- supermarkets3414 %>%
  select(c('Name', 'geometry'))
supermarkets3414
Simple feature collection with 526 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4901.188 ymin: 25529.08 xmax: 46948.22 ymax: 49233.6
Projected CRS: SVY21 / Singapore TM
First 10 features:
     Name                  geometry
1   kml_1 POINT (35561.22 42685.17)
2   kml_2 POINT (32184.01 32947.46)
3   kml_3 POINT (33903.48 39480.46)
4   kml_4 POINT (37083.82 35017.47)
5   kml_5  POINT (41320.3 37283.82)
6   kml_6 POINT (41384.47 37152.14)
7   kml_7 POINT (30186.63 38602.77)
8   kml_8 POINT (28380.83 38842.16)
9   kml_9 POINT (34383.76 37311.19)
10 kml_10 POINT (29010.23 45755.51)
tourism3414 <- tourism3414 %>%
  select(c('NAME', 'geometry'))
tourism3414
Simple feature collection with 109 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11380.23 ymin: 22858.67 xmax: 43659.54 ymax: 47627.69
Projected CRS: SVY21 / Singapore TM
First 10 features:
                                                       NAME
1                                National Gallery Singapore
2                   Sultan Mosque (Masjid Sultan) Singapore
3           Sri Mariamman Temple: Hindu Temple in Singapore
4                              Armenian Church in Singapore
5                                         CHIJMES Singapore
6  St Andrewâ??s Cathedral- Singapore Architecture Landmark
7                                         Kreta Ayer Square
8                                  Albert Mall Trishaw Park
9                                     Chinatown Food Street
10                     Chinatown Heritage Centre, Singapore
                    geometry
1  POINT (30007.41 30267.17)
2  POINT (30877.14 31594.08)
3  POINT (29342.34 29382.57)
4   POINT (29818.66 30598.9)
5  POINT (30043.47 30820.05)
6  POINT (30113.58 30488.32)
7   POINT (29142.6 29293.89)
8  POINT (30244.32 31323.04)
9  POINT (29165.65 29443.75)
10 POINT (29227.71 29603.72)

Dealing with the anomaly - MRT/LRT

Despite it saying that’s point data, it gives polygon data, therefore we need to convert the polygon into point data. Unfortunately, there appears to be an issue that causes st_make_valid to not work, due to missing data within the polygon (There is an ‘na’).

Thus, we’ll take this chance to do some fine tuning and make use of OneMapSG to return us the correct coordinates. We will also remove all ‘depots’ as they are not considered MRT/LRTs for our dataset.

mrt_lrt_new_set <- read_csv("data/aspatial/mrt_lrt_lat_long.csv")

Convert into an SF Object

mrt_lrt_ns_sf <- st_as_sf(mrt_lrt_new_set, coords=c("LONG", "LAT"), crs=4326)
mrt_lrt_ns_sf <- mrt_lrt_ns_sf %>%
  select(c('STN_NAM_DE', 'geometry'))
mrt_lrt_ns_sf_3414 <- st_transform(mrt_lrt_ns_sf, crs=3414)

View the map to make sure the mapping is correct

tmap_mode("view")+
  tm_shape(bus_stop)+
  tm_dots(col="purple", size=0.05)

Looks like we have some bus stations in Malaysia, which makes sense, as there are some Singapore buses that travel into Malaysia.

These stops are:

  • LARKIN TERMINAL (Malaysia) [46239]

  • KOTARAYA II TER [46609]

  • JB SENTRAL [47701]

  • JOHOR BAHRU CHECKPOINT (46211, 46219)

We don’t want them in our calculation, so we’ll remove them.

bus_stop_to_remove <- c(46239, 46609, 47701, 46211, 46219)
bus_stop_cleaned <- filter(bus_stop, !(BUS_STOP_N %in% bus_stop_to_remove))

Check the map again:s

tm_shape(bus_stop_cleaned)+
  tm_dots(col="purple", size=0.05)

Convert Bus Stop from WGS84 to SVY21

bus_stop_cleaned_3414 <- st_transform(bus_stop_cleaned, crs = 3414)
st_crs(bus_stop_cleaned_3414)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Data Wrangling (Aspatial)

We will strictly be looking at Four Room Flats

  • Training Set: 1st January 2021 - 31st December 2022

  • Test Set: January - February 2023

resale_flat_sub <-  filter(resale_flat,flat_type == "4 ROOM") %>% 
              filter(month >= "2021-01" & month <= "2023-02")

Transforming the Data into usable data

Looking at the data, we will now need to combine both the block and street name to form the address so we can lookup for the postal code and retrieve the LAT/LONG

We will also need to calculate the remaining number of lease. Given that the remaining lease are in ‘YEAR’ + ‘MONTH’ format, we need to change it into a continuous number. We will change it to ‘MONTHS’.

rs_modified <- resale_flat_sub %>%
  mutate(resale_flat_sub, address = paste(block,street_name)) %>%
  mutate(resale_flat_sub, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
  mutate(resale_flat_sub, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))

Calculate the LEASE in Months and remove all intermediary columns

  • Convert all NA into 0

  • Sum up YEAR and MONTH and place them into a new column

  • Select only the columns we need.

rs_modified$remaining_lease_mth[is.na(rs_modified$remaining_lease_mth)] <- 0

rs_modified$remaining_lease_yr <- rs_modified$remaining_lease_yr * 12
rs_modified <- rs_modified %>% 
  mutate(remaining_lease_mths = rowSums(rs_modified[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
  select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model, 
         lease_commence_date, remaining_lease_mths, resale_price)

Glimpse at our data:

glimpse(rs_modified)

Retrieving the LAT/LONG Coordinates from OneMap.sg API

The following solution is courtesy of: Nor Aisyah

We first create a unique list of addresses

add_list <- sort(unique(rs_modified$address))

Now we will focus on getting the data needed.

In the following code chunk, the goal is to retrieve coordinates for a given search value using OneMap SG’s REST APIs. The code creates a dataframe to store the final retrieved coordinates and uses the httr package’s GET() function to make a GET request to https://developers.onemap.sg/commonapi/search.

The required variables to be included in the GET request are searchVal, returnGeom {Y/N}, and getAddrDetails {Y/N}.

The JSON response returned will contain multiple fields, but the code is only interested in the postal code and coordinates like Latitude & Longitude. A new dataframe is created to store each final set of coordinates retrieved during the loop, and the number of responses returned is checked to append to the main dataframe accordingly. Invalid addresses are also checked, and the response is appended to the main dataframe using rbind() function of base R package.

retrieve_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://developers.onemap.sg/commonapi/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

We call the function to get the coordinates

coords <- retrieve_coords(add_list)

Inspect the results

coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]

From the results, we can see that it has missing postal code. However, the postal code is not as relevant to our analysis as latitude and longitude.

The missing respective postal code are:

  • 680215

  • 680216

rs_coords <- left_join(rs_modified, coords, by = c('address' = 'address'))

Write file to RDS

write_rds(rs_coords, "data/aspatial/rs_coords.rds")

Read the RDS file

resale_sub_flat <- read_rds("data/aspatial/rs_coords.rds")

Assign and transform CRS

Data from OneMapSG are in WGS84, as evident from the decimal. So we will create a SF object in that code before transforming into SVY21

resale_coords_sf <- st_as_sf(resale_sub_flat,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)
st_crs(resale_coords_sf)

Check if there are any NA outside of what we confirmed earlier

rows_with_na <- which(is.na(resale_coords_sf), arr.ind=TRUE)
if (nrow(rows_with_na) > 0) {
  message("The following rows have NA values:")
  print(rows_with_na)
} else {
  message("The dataframe does not contain any rows with NA values.")
}

Remove all unnecessary rows

We only need to keep the identifier, time, resale price, spatial point data, and the structural factors as listed below

  • Area of Unit

  • Floor Level

  • Remaining Lease

  • Age of Unit

Note that we have not handled the age of unit yet. We’ll handle that later.

trim_resale_flat <- resale_coords_sf %>%
  select(1:3, 6:8, 11:12)

Determining CBD

We need to consider the distance to CBD as well. We will take ‘Downtown Core’ as our reference point, which is located in the southwest of Singapore.

cbd_lat <- 1.287953
cbd_long <- 103.851784

cbd_sf <- data.frame(cbd_lat, cbd_long) %>%
  st_as_sf(coords = c("cbd_long", "cbd_lat"), crs=4326) %>%
  st_transform(crs=3414)

Proximity Distance Calculation

For our next step, we will need to integrate all of our geospatial data by calculating the proximity to the amenities.

The following function calculates proximity

  • rowMins from matrixStats package finds the shortest possible distance within the distance matrix.
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    units::drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}

Implement the Proximity Calculations:

trim_resale_flat <- proximity(trim_resale_flat, cbd_sf, "PROX_CBD") %>%
  proximity(., eldercare3414, "PROX_ELDERCARE") %>%
  proximity(., hawker_centre3414, "PROX_HAWKER") %>%
  proximity(., isp_clinics3414, "PROX_ISP") %>%
  proximity(., mrt_lrt_ns_sf_3414, "PROX_MRT") %>%
  proximity(., parks3414, "PROX_PARKS") %>%
  proximity(., top_primary_sch_sf_3414, "PROX_TOP_PRI") %>%
  proximity(., shopping_mall_sf_3414, "PROX_SHOPPING") %>%
  proximity(., supermarkets3414, "PROX_SUPERMARKETS")

Facility Count within Radius

In addition to calculating the shortest distance between points, we’re also interested in finding out how many facilities are located within a specific radius. To accomplish this, we’ll use the st_distance() function to calculate the distance between the flats and the desired facilities. We’ll then use rowSums() to add up the observations and obtain the count of facilities within the desired radius. The resulting values will be added to the data frame as a new column.

num_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    units::drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}

Implement the function:

trim_resale_flat <- 
  num_radius(trim_resale_flat, kindergarten3414, "NUM_KINDERGARTEN", 350) %>%
  num_radius(., childcare3414, "NUM_CHILDCARE", 350) %>%
  num_radius(., bus_stop_cleaned_3414, "NUM_BUS_STOPS", 350) %>%
  num_radius(., primary_sch_sf_3414, "NUM_PRI_SCHS", 1000) %>%
  num_radius(., tourism3414, "NUM_TOURIST_SITES", 1000)

Save the Data as RDS

Before we do that, we should trim the name, as when converting to RDS the column names are shortened due to limitations

trim_resale_flat <- trim_resale_flat %>%
  mutate() %>%
  rename("AREA_SQM" = "floor_area_sqm", 
         "LEASE_MTHS" = "remaining_lease_mths", 
         "PRICE"= "resale_price",
         "FLAT" = "flat_type", 
         "STOREY" = "storey_range")
st_write(trim_resale_flat, "data/geospatial/resale_flat_final.shp")

Reimport by reloading our dataset

resale_flat_sf <- st_read(dsn="data/geospatial", layer="resale_flat_final") 
Reading layer `resale_flat_final' from data source 
  `D:\Users\Wen Yang\Documents\wyt05\IS415-GAA-WY\take-home-ex\take-home-ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 25503 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.79 ymin: 28217.39 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM

We’re done with gathering and creating our data! Now we can move onto our exploratory data analysis and building our prediction models. Note that there are still some data that need to be modified. Such as ordinal data like Storeys and also the calculation of the flat’s age.

Calculate age of unit

To calculate the age of unit, we will take 2023 (current year) - lease_commence_date to find out the age of the flat. Then we’ll append it into our new sf object.

resale_flat_sub_age <- resale_flat_sub %>%
  mutate("age_of_unit" = 2023 - lease_commence_date)
resale_flat_sf <- resale_flat_sf %>%
  mutate("age_of_unit" = resale_flat_sub_age$age_of_unit)

Speaking of which, we’ll handle them now.

Storey Data:

It’s a categorical data that has meaning when ordered. Higher/lower levels could potentially have an impact on the price of the HDB Flat. Therefore, we should convert this column of data into numbers, one that the regression can learn from.

storeys <- sort(unique(resale_flat_sf$STOREY))
storey_order <- 1:length(storeys)
storey_range_order <- data.frame(storeys, storey_order)
storey_range_order
    storeys storey_order
1  01 TO 03            1
2  04 TO 06            2
3  07 TO 09            3
4  10 TO 12            4
5  13 TO 15            5
6  16 TO 18            6
7  19 TO 21            7
8  22 TO 24            8
9  25 TO 27            9
10 28 TO 30           10
11 31 TO 33           11
12 34 TO 36           12
13 37 TO 39           13
14 40 TO 42           14
15 43 TO 45           15
16 46 TO 48           16
17 49 TO 51           17

From the above, we can see that the storey values have been sorted from 1 - 10

We will combine this data with the base resale SF. Unfortunately, we have to drop the geometric column first before we can perform the join. So we’ll do that, then we’ll combine both the columns

geom_temp_store <- st_geometry(resale_flat_sf)
temp_no_geo <- st_drop_geometry(resale_flat_sf)
temp_join <- dplyr::left_join(temp_no_geo, storey_range_order, by=c("STOREY"= "storeys"))
final_resale_output <- st_as_sf(temp_join, geom_temp_store)
final_resale_output
Simple feature collection with 25503 features and 24 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.79 ymin: 28217.39 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
First 10 features:
     month       town               address   FLAT   STOREY AREA_SQ LEASE_M
1  2021-01 ANG MO KIO 547 ANG MO KIO AVE 10 4 ROOM 04 TO 06      92    8496
2  2021-01 ANG MO KIO 414 ANG MO KIO AVE 10 4 ROOM 01 TO 03      92    8217
3  2021-01 ANG MO KIO  509 ANG MO KIO AVE 8 4 ROOM 01 TO 03      91    8358
4  2021-01 ANG MO KIO 467 ANG MO KIO AVE 10 4 ROOM 07 TO 09      92    8219
5  2021-01 ANG MO KIO  571 ANG MO KIO AVE 3 4 ROOM 07 TO 09      92    8213
6  2021-01 ANG MO KIO  134 ANG MO KIO AVE 3 4 ROOM 07 TO 09      98    8073
7  2021-01 ANG MO KIO  204 ANG MO KIO AVE 3 4 ROOM 07 TO 09      92    7921
8  2021-01 ANG MO KIO  429 ANG MO KIO AVE 3 4 ROOM 04 TO 06      92    8074
9  2021-01 ANG MO KIO 413 ANG MO KIO AVE 10 4 ROOM 10 TO 12      92    8216
10 2021-01 ANG MO KIO 415 ANG MO KIO AVE 10 4 ROOM 07 TO 09      92    8216
    PRICE  PROX_CB    PROX_EL  PROX_HA   PROX_IS   PROX_MR  PROX_PA   PROX_TO
1  370000 9564.575 1085.67795 444.2515 358.32528 1072.6778 752.0454 1874.4065
2  375000 8401.690  150.39052 205.0009 149.98683  824.1874 648.4302 1429.2226
3  380000 9516.492  722.42472 449.5734  95.43992  454.4549 367.7768 1768.9221
4  385000 8580.908   98.16285 319.0679 211.20843  949.9420 390.6401 1775.9121
5  410000 9084.958  592.69462 258.9414 194.85828  592.4801 495.0982 2028.3228
6  410000 9366.153  347.25019 404.5946 129.38684  602.5514 256.6450  870.1812
7  410000 8824.749  251.40646 441.8265 385.33722  688.4964 759.6587 1270.3903
8  418000 8973.505  453.59397 339.3141 260.73764  439.9443 372.5338 1851.5463
9  420000 8359.105  174.57871 206.3100 194.64382  843.7179 672.9028 1380.8247
10 420000 8461.580   85.45355 266.6457 161.49188  758.2938 684.5322 1453.4366
     PROX_SH  PROX_SU NUM_KIN NUM_CHI NUM_BUS NUM_PRI NUM_TOU age_of_unit
1  1216.2884 418.4204       1       3       4       0       0          42
2   844.4986 194.6009       0       3       7       2       0          44
3   560.0964 443.5109       1       4      10       1       0          43
4   930.4149 426.9715       1       4       4       2       0          44
5   719.5547 192.5617       1       4       8       1       0          44
6   791.2458 264.9240       0       2       2       3       0          45
7   546.0414 270.8222       1       7       8       2       0          46
8   551.4709 386.5979       1       4       7       1       0          45
9   856.0013 220.0960       0       3       6       2       0          44
10  781.9466 128.3026       0       3       7       2       0          44
   storey_order           geom_temp_store
1             2 POINT (30770.07 39578.64)
2             1 POINT (30292.01 38439.17)
3             1    POINT (29872 39555.56)
4             3  POINT (30613.6 38603.54)
5             3 POINT (30399.59 39119.25)
6             3 POINT (28960.32 39342.78)
7             3 POINT (29179.92 38822.08)
8             2 POINT (30237.21 39012.48)
9             4 POINT (30265.47 38397.28)
10            3 POINT (30263.23 38499.84)

Rename to Columns so they are more readable

Due to abbreviation from Shapefile limitation, we will rename them back so it’s readable.

final_resale_output <- final_resale_output %>%
  mutate() %>%
  rename("PROX_CBD" = "PROX_CB", 
         "PROX_ELDERCARE" = "PROX_EL", 
         "PROX_HAWKER"= "PROX_HA",
         "PROX_ISP" = "PROX_IS", 
         "PROX_MRT" = "PROX_MR",
         "PROX_PARK" = "PROX_PA",
         "PROX_TOP_PRI" = "PROX_TO",
         "PROX_SHOPPING" = "PROX_SH",
         "PROX_SUPERMARKET" = "PROX_SU",
         "NUM_KINDERGARTEN" = "NUM_KIN",
         "NUM_CHILDCARE" = "NUM_CHI",
         "NUM_BUS_STOPS" = "NUM_BUS",
         "NUM_PRI_SCHS" = "NUM_PRI",
         "NUM_TOURIST_SITES" = "NUM_TOU",
         "geometry" = "geom_temp_store")

Drop ’STOREY”

final_resale_output <- final_resale_output %>%
  select(c(1:4, 6:24))

Exploratory Analysis:

Now that we have finished our data wrangling, we can move into exploratory analysis and prediction.

We will start by visualizing the resale prices

ggplot(data=final_resale_output, aes(x=`PRICE`)) +
  geom_histogram(bins=20, color="black", fill="light blue") +
    labs(title = "Distribution of Resale Prices",
         x = "Resale Prices",
         y = 'Frequency')

There appears to be a right-skewed distribution. This means that the majority of the housing prices are transacted on the lower end with some extremes.

Histogram Plotting of the Structural Factors

Extract the column names to plot

s_factors <- c("AREA_SQ", "LEASE_M", "age_of_unit", "storey_order")

Create a list to store histograms on structural factors

  • Create a vector of the size of our structural vectors (4 in this case)

  • Plot the histogram for each structural factor

  • Append the histogram to the created vector

s_factor_hist_list <- vector(mode = "list", length = length(s_factors))
for (i in 1:length(s_factors)) {
  hist_plot <- ggplot(final_resale_output, aes_string(x = s_factors[[i]])) +
    geom_histogram(color="black", fill = "light blue") +
    labs(title = s_factors[[i]]) +
    theme(plot.title = element_text(size = 10),
          axis.title = element_blank())
  
  s_factor_hist_list[[i]] <- hist_plot
}

Plot the histograms

We’ll use the ggarrange() function to organize them into a 2x2 column row plot.

ggarrange(plotlist = s_factor_hist_list,
          ncol = 2,
          nrow = 2)

AREA_SQ:

It appears that most four-room flats roam around 90+ square feet. However, there are anomalies where it exceeds 110 square feet.

LEASE_M and Age of Unit

They appear to be inversely correlated. Where there is a large number of four-room flats that are not old and have a long lease life ahead.

Story Order:

There appear to be a familiar pattern where most of the four-room flats are built on lower floors. Though, this might be an effect where most HDBs are just built lower in general, with some anomalies of HDB flats which are built extremely high.

Locational Factors

l_factor <- c("PROX_CBD", "PROX_ELDERCARE", "PROX_HAWKER", "PROX_ISP", "PROX_MRT", "PROX_PARK", "PROX_TOP_PRI", "PROX_SHOPPING", "PROX_SUPERMARKET",
              "NUM_KINDERGARTEN", "NUM_CHILDCARE", "NUM_BUS_STOPS", "NUM_PRI_SCHS", "NUM_TOURIST_SITES")
l_factor_hist_list <- vector(mode = "list", length = length(l_factor))
for (i in 1:length(l_factor)) {
  hist_plot <- ggplot(final_resale_output, aes_string(x = l_factor[[i]])) +
    geom_histogram(color="midnight blue", fill = "light sky blue") +
    labs(title = l_factor[[i]]) +
    theme(plot.title = element_text(size = 10),
          axis.title = element_blank())
  
  l_factor_hist_list[[i]] <- hist_plot
}
ggarrange(plotlist = l_factor_hist_list,
          ncol= 4,
          nrow = 4)

From observation, it appears that ELDERCARE are in the right-skewed in distribution. Which means some areas have a lot, while majority have almost none. Same goes for hawkers, ISP, MRT, PARK, SUPERMARKETS and shopping. This is a good thing, as this means most HDB are close to ISP, MRT, PARKS, SUPERMARKETS and shopping malls.

View the prices on the map:

tmap_mode("view")
tmap_options(check.and.fix = TRUE)
tm_shape(final_resale_output) +  
  tm_dots(col = "PRICE",
          alpha = 0.6,
          style="quantile") +
  # sets minimum zoom level to 11, sets maximum zoom level to 14
  tm_view(set.zoom.limits = c(11,14))

Change back to plot.

tmap_mode("plot")

Correlation Plot

Now, we’ll perform the correlation plot to look at the correlation between the different columns. We will drop columns there is high multicolinearity between them > 0.8,

To do that, we will remove the ‘geom’ from the table.

final_resale_output_nogeom <- final_resale_output %>%
  st_drop_geometry()

View the chart

corrplot::corrplot(cor(final_resale_output_nogeom[, 5:ncol(final_resale_output_nogeom)]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

From the corrplot chart, we can see the the remaining lease is inversely correlated with age of unit. Therefore, it will be excluded for our linear regression.

Split our Data

For our OLS and GWS models, we will be using 1st January 2021 to 31st December 2022 for training set and January and February 2023 for our test set.

Set seed.

set.seed(1234)
resale_flat_train <-filter(final_resale_output, month >= "2021-01" & month <= "2022-12")
resale_flat_test <-filter(final_resale_output, month >= "2023-01" & month <= "2023-02")

Non-Spatial OLS

price_mlr <- lm(PRICE ~ AREA_SQ +
                  storey_order + LEASE_M +
                  PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                  PROX_ISP + PROX_MRT + PROX_PARK + PROX_TOP_PRI + 
                  PROX_SHOPPING + PROX_SUPERMARKET + NUM_KINDERGARTEN +
                  NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + 
                  NUM_TOURIST_SITES,
                data=resale_flat_train)
summary(price_mlr)

Call:
lm(formula = PRICE ~ AREA_SQ + storey_order + LEASE_M + PROX_CBD + 
    PROX_ELDERCARE + PROX_HAWKER + PROX_ISP + PROX_MRT + PROX_PARK + 
    PROX_TOP_PRI + PROX_SHOPPING + PROX_SUPERMARKET + NUM_KINDERGARTEN + 
    NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + NUM_TOURIST_SITES, 
    data = resale_flat_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-364678  -44410   -1224   43893  361399 

Coefficients:
                    Estimate Std. Error  t value Pr(>|t|)    
(Intercept)       56418.2024  8071.0816    6.990 2.82e-12 ***
AREA_SQ            3470.3497    69.4165   49.993  < 2e-16 ***
storey_order      15899.2024   228.3414   69.629  < 2e-16 ***
LEASE_M              30.1038     0.2686  112.070  < 2e-16 ***
PROX_CBD            -16.6569     0.1621 -102.771  < 2e-16 ***
PROX_ELDERCARE       -8.8123     0.8041  -10.959  < 2e-16 ***
PROX_HAWKER         -20.6838     0.9481  -21.817  < 2e-16 ***
PROX_ISP            -17.2855     2.3869   -7.242 4.56e-13 ***
PROX_MRT            -16.5187     1.2924  -12.781  < 2e-16 ***
PROX_PARK             5.1464     1.1690    4.402 1.08e-05 ***
PROX_TOP_PRI          1.8706     0.2472    7.568 3.94e-14 ***
PROX_SHOPPING       -13.1984     1.3355   -9.883  < 2e-16 ***
PROX_SUPERMARKET     10.1508     3.1209    3.253  0.00115 ** 
NUM_KINDERGARTEN  10134.4235   515.2692   19.668  < 2e-16 ***
NUM_CHILDCARE     -3486.3416   245.9476  -14.175  < 2e-16 ***
NUM_BUS_STOPS      1051.2177   168.3720    6.243 4.35e-10 ***
NUM_PRI_SCHS      -6515.3464   333.8226  -19.517  < 2e-16 ***
NUM_TOURIST_SITES  4902.9376   316.1419   15.509  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 68670 on 23638 degrees of freedom
Multiple R-squared:  0.7191,    Adjusted R-squared:  0.7189 
F-statistic:  3559 on 17 and 23638 DF,  p-value: < 2.2e-16

Save Results:

write_rds(price_mlr, "data/model/price_mlr.rds" )

Predict Results (OLS) LM

lm_ols_pred <- predict.lm(price_mlr, 
                           resale_flat_test, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)
lm_ols_pred_df <- as.data.frame(lm_ols_pred)
test_data_ols_lm <- cbind(resale_flat_test, lm_ols_pred_df)

Calculate Root Mean Square Error (RMSE)

rmse(test_data_ols_lm$PRICE, 
     test_data_ols_lm$lm_ols_pred)
[1] 82456.97

Visualize the Predicted Values (Non spatially weighted)

ggplot(data = test_data_ols_lm,
       aes(x = lm_ols_pred,
           y = PRICE)) +
  geom_point()

From a look at the results, it appears that the model is performing quite well. However, we will now introduce geographically weighted regression to see if we can improved the prediction model.

Geographically Weighted Regression

Determine Adaptive Bandwidth for GWR:

Need to convert into SP Object for Train Data

resale_flat_train_sp <- as_Spatial(resale_flat_train)
resale_flat_train_sp
class       : SpatialPointsDataFrame 
features    : 23656 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 23
names       :   month,       town,          address,   FLAT, AREA_SQ, LEASE_M,   PRICE,         PROX_CBD,   PROX_ELDERCARE,      PROX_HAWKER,         PROX_ISP,         PROX_MRT,        PROX_PARK,     PROX_TOP_PRI,     PROX_SHOPPING, ... 
min values  : 2021-01, ANG MO KIO,   1 CHAI CHEE RD, 4 ROOM,      70,    6342,  250000, 999.393538715878, 1.9938461724e-05, 30.6031805185926,  2.996413259e-06, 22.6058582568769, 45.4199446964984, 65.2546212607695, 0.002762755353668, ... 
max values  : 2022-12,     YISHUN, 9B BOON TIONG RD, 4 ROOM,     145,   13972, 1370000, 19650.0691667807, 3301.63731683139, 2867.63031236184, 1778.93493479916, 2104.35522470249, 2411.71566711429, 10622.3747966971,  2321.60267410058, ... 

Need to convert into SP Object for Test Data

resale_flat_test_sp <- as_Spatial(resale_flat_test)
resale_flat_test_sp
class       : SpatialPointsDataFrame 
features    : 1847 
extent      : 11655.33, 42444.75, 28287.8, 48675.05  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 23
names       :   month,       town,            address,   FLAT, AREA_SQ, LEASE_M,   PRICE,         PROX_CBD,    PROX_ELDERCARE,      PROX_HAWKER,         PROX_ISP,         PROX_MRT,        PROX_PARK,     PROX_TOP_PRI,    PROX_SHOPPING, ... 
min values  : 2023-01, ANG MO KIO,          1 PINE CL, 4 ROOM,      74,    6193,  320000, 1029.97863472461, 0.000989228659617, 50.3736883264917, 1.5923797982e-05, 43.9758644103355, 46.0386442635826, 135.674873932238, 46.9043803614523, ... 
max values  : 2023-02,     YISHUN, 997C BUANGKOK CRES, 4 ROOM,     121,   13683, 1290000, 19403.6799155058,  3261.54359020376, 2720.70824174618,  1503.6341881287, 1999.88993064771, 2398.40352003288, 10429.3260312532, 2321.60267410058, ... 

Computing Bandwidth on Training Data

bw_adaptive <- bw.gwr(PRICE ~ AREA_SQ +
                  storey_order + LEASE_M +
                  PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                  PROX_ISP + PROX_MRT + PROX_PARK + PROX_TOP_PRI + 
                  PROX_SHOPPING + PROX_SUPERMARKET + NUM_KINDERGARTEN +
                  NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + 
                  NUM_TOURIST_SITES,
                  data=resale_flat_train_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)

Export the model

write_rds(bw_adaptive, "data/model/bw_adaptive.rds")
bw_adaptive <- read_rds("data/model/bw_adaptive.rds")

Computing Bandwidth based on Test Data

bw_adaptive_test <- bw.gwr(PRICE ~ AREA_SQ +
                  storey_order + LEASE_M +
                  PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                  PROX_ISP + PROX_MRT + PROX_PARK + PROX_TOP_PRI + 
                  PROX_SHOPPING + PROX_SUPERMARKET + NUM_KINDERGARTEN +
                  NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + 
                  NUM_TOURIST_SITES,
                  data=resale_flat_test_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)

Save the Test bandwidth model

write_rds(bw_adaptive_test, "data/model/bw_adaptive_test.rds")
bw_adaptive_test <- read_rds("data/model/bw_adaptive_test.rds")
bw_adaptive_complete <- bw_adaptive_test + bw_adaptive
bw_adaptive_complete
[1] 236

Run the Predict GWR Model

Something WRONG, come back to this later!

gwr_adaptive <- gwr.predict(formula = PRICE ~ AREA_SQ +
                  storey_order + LEASE_M +
                  PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                  PROX_ISP + PROX_MRT + PROX_PARK + PROX_TOP_PRI + 
                  PROX_SHOPPING + PROX_SUPERMARKET + NUM_KINDERGARTEN +
                  NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + 
                  NUM_TOURIST_SITES,
                  data=resale_flat_train_sp,
                  predictdata=resale_flat_test_sp,
                  bw=bw_adaptive_complete, 
                  kernel = 'gaussian', 
                  adaptive=TRUE,
                  longlat = FALSE,
                  )

Save the model

write_rds(gwr_adaptive, "data/model/gwr_adaptive.rds")

Read the model back

gwr_adaptive <- read_rds("data/model/gwr_adaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-18 23:08:28 
   Call:
   gwr.basic(formula = PRICE ~ AREA_SQ + storey_order + LEASE_M + 
    PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_ISP + PROX_MRT + 
    PROX_PARK + PROX_TOP_PRI + PROX_SHOPPING + PROX_SUPERMARKET + 
    NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + 
    NUM_TOURIST_SITES, data = resale_flat_train_sp, bw = bw_adaptive, 
    kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  PRICE
   Independent variables:  AREA_SQ storey_order LEASE_M PROX_CBD PROX_ELDERCARE PROX_HAWKER PROX_ISP PROX_MRT PROX_PARK PROX_TOP_PRI PROX_SHOPPING PROX_SUPERMARKET NUM_KINDERGARTEN NUM_CHILDCARE NUM_BUS_STOPS NUM_PRI_SCHS NUM_TOURIST_SITES
   Number of data points: 23656
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-364678  -44410   -1224   43893  361399 

   Coefficients:
                       Estimate Std. Error  t value Pr(>|t|)    
   (Intercept)       56418.2024  8071.0816    6.990 2.82e-12 ***
   AREA_SQ            3470.3497    69.4165   49.993  < 2e-16 ***
   storey_order      15899.2024   228.3414   69.629  < 2e-16 ***
   LEASE_M              30.1038     0.2686  112.070  < 2e-16 ***
   PROX_CBD            -16.6569     0.1621 -102.771  < 2e-16 ***
   PROX_ELDERCARE       -8.8123     0.8041  -10.959  < 2e-16 ***
   PROX_HAWKER         -20.6838     0.9481  -21.817  < 2e-16 ***
   PROX_ISP            -17.2855     2.3869   -7.242 4.56e-13 ***
   PROX_MRT            -16.5187     1.2924  -12.781  < 2e-16 ***
   PROX_PARK             5.1464     1.1690    4.402 1.08e-05 ***
   PROX_TOP_PRI          1.8706     0.2472    7.568 3.94e-14 ***
   PROX_SHOPPING       -13.1984     1.3355   -9.883  < 2e-16 ***
   PROX_SUPERMARKET     10.1508     3.1209    3.253  0.00115 ** 
   NUM_KINDERGARTEN  10134.4235   515.2692   19.668  < 2e-16 ***
   NUM_CHILDCARE     -3486.3416   245.9476  -14.175  < 2e-16 ***
   NUM_BUS_STOPS      1051.2177   168.3720    6.243 4.35e-10 ***
   NUM_PRI_SCHS      -6515.3464   333.8226  -19.517  < 2e-16 ***
   NUM_TOURIST_SITES  4902.9376   316.1419   15.509  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 68670 on 23638 degrees of freedom
   Multiple R-squared: 0.7191
   Adjusted R-squared: 0.7189 
   F-statistic:  3559 on 17 and 23638 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1.114511e+14
   Sigma(hat): 68642
   AIC:  594066.5
   AICc:  594066.5
   BIC:  570755.2
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 208 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                            Min.     1st Qu.      Median     3rd Qu.       Max.
   Intercept         -1.5925e+07 -3.1195e+05  1.3656e+04  4.1936e+05 2.0284e+07
   AREA_SQ           -3.3266e+04  1.7885e+03  2.6753e+03  4.7928e+03 2.0665e+04
   storey_order       6.4706e+03  9.7706e+03  1.1832e+04  1.3552e+04 2.6041e+04
   LEASE_M           -2.3641e+02  1.6408e+01  2.8503e+01  4.2442e+01 6.2055e+01
   PROX_CBD          -2.1065e+03 -4.1283e+01 -1.5436e+01  2.7500e+01 1.6728e+03
   PROX_ELDERCARE    -8.0722e+02 -1.6512e+01  1.0508e+01  4.2494e+01 7.5826e+02
   PROX_HAWKER       -1.0470e+03 -3.3894e+01 -4.9925e+00  2.9135e+01 3.1824e+02
   PROX_ISP          -3.1913e+02 -4.1878e+01 -1.1588e+01  1.5884e+01 2.9390e+02
   PROX_MRT          -5.5292e+02 -8.9593e+01 -4.7872e+01 -9.9661e+00 5.4679e+02
   PROX_PARK         -7.3975e+02 -4.5062e+01 -1.3701e+01  1.5997e+01 1.0985e+03
   PROX_TOP_PRI      -1.7545e+03 -6.1904e+01 -1.2907e+01  2.3466e+01 2.1158e+03
   PROX_SHOPPING     -3.6330e+02 -3.9704e+01 -5.8207e+00  2.3372e+01 6.2958e+02
   PROX_SUPERMARKET  -2.8839e+02 -4.3438e+01 -4.0071e+00  2.7046e+01 4.0218e+02
   NUM_KINDERGARTEN  -4.0966e+04 -7.5051e+03 -2.6599e+03  2.4252e+03 2.8284e+04
   NUM_CHILDCARE     -1.4333e+04 -2.3495e+03 -2.7022e+01  2.0938e+03 1.6330e+04
   NUM_BUS_STOPS     -7.1101e+03 -1.3664e+03 -5.2787e+01  1.5321e+03 6.1972e+03
   NUM_PRI_SCHS      -3.1974e+04 -6.6428e+03 -4.8182e+02  4.4575e+03 2.8489e+04
   NUM_TOURIST_SITES -2.2698e+06 -6.5476e+04 -5.9590e+02  4.3428e+04 2.3308e+06
   ************************Diagnostic information*************************
   Number of data points: 23656 
   Effective number of parameters (2trace(S) - trace(S'S)): 949.4115 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 22706.59 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 568726.7 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 567928.1 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 551054.1 
   Residual sum of squares: 3.582523e+13 
   R-square value:  0.9096981 
   Adjusted R-square value:  0.9059222 

   ***********************************************************************
   Program stops at: 2023-03-18 23:15:41 

Predict GWR Adaptive (NO USE -> IT’S THE LM MODEL)

gwr_lm_pred <- predict.glm(gwr_adaptive$lm, 
                           resale_flat_test, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)
gwrlm_pred_df <- as.data.frame(gwr_lm_pred)
test_data_gwr_lm <- cbind(resale_flat_test, gwrlm_pred_df)

Calculate Root Mean Square Error (RMSE)

rmse(test_data_gwr_lm$PRICE, 
     test_data_gwr_lm$gwr_lm_pred)
[1] 82456.97

Visualize the Predicted Values (Spatially weighted)

ggplot(data = test_data_gwr_lm,
       aes(x = gwr_lm_pred,
           y = PRICE)) +
  geom_point()

Geogprahical Random Forest Model

Extracting coordinates data

coords_train <- st_coordinates(resale_flat_train)
coords_test <- st_coordinates(resale_flat_test)

Dropping the Geometry column for training

resale_flat_train_no_geom <- resale_flat_train %>% 
  st_drop_geometry()

Getting the bandwidth

We will use the grf.bw method to obtain the bandwidth

grf_bw <- grf.bw(formula = PRICE ~ AREA_SQ +
                  storey_order + LEASE_M +
                  PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
                  PROX_ISP + PROX_MRT + PROX_PARK + PROX_TOP_PRI + 
                  PROX_SHOPPING + PROX_SUPERMARKET + NUM_KINDERGARTEN +
                  NUM_CHILDCARE + NUM_BUS_STOPS + NUM_PRI_SCHS + 
                  NUM_TOURIST_SITES,
                  dataset=resale_flat_train_no_geom,
                  coords=coords_train,
                  kernel="adaptive")